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Abstract 

In many engineering and scientific applications, prediction variables are grouped, 
for example, in biological applications where assayed genes or proteins can be grouped 
by biological roles or biological pathways. Common statistical analysis methods such 
as ANOVA, factor analysis, and functional modeling with basis sets also exhibit nat- 
ural variable groupings. Existing successful group variable selection methods such as 
Antoniadis and Fan (2001), Yuan and Lin (2006) and Zhao, Rocha and Yu (2009) have 
the limitation of selecting variables in an "all-in-all-out" fashion, i.e., when one vari- 
able in a group is selected, all other variables in the same group arc also selected. In 
many real problems, however, we may want to keep the flexibility of selecting variables 
within a group, such as in gene-set selection. In this paper, we develop a new group 
variable selection method that not only removes unimportant groups effectively, but 
also keeps the flexibility of selecting variables within a group. We also show that the 
new method offers the potential for achieving the theoretical "oracle" property as in 
Fan and Li (2001) and Fan and Peng (2004). 

Keywords: Group selection; Lasso; Oracle property; Regularization; Variable selec- 
tion 
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1 Introduction 



Consider the usual regression situation: we have training data, {xi,yi), . . ., {Xi,yi), . . ., 
{"^ni Vn) 1 where = (xji, . . . jXip) are the predictors and Ui is the response. To model the 
response y in terms of the predictors Xi, . . . , Xp, one may consider the linear model: 

y = PQ + /3ixi + ... + /3pXp + e, (1) 

where e is the error term. In many important practical problems, however, prediction vari- 
ables are "grouped." For example, in ANOVA factor analysis, a factor may have several 
levels and can be expressed via several dummy variables, then the dummy variables corre- 
sponding to the same factor form a natural "group." Similarly, in additive models, each 
original prediction variable may be expanded into different order polynomials or a set of ba- 
sis functions, then these polynomials (or basis functions) corresponding to the same original 
prediction variable form a natural "group." Another example is in biological applications, 
where assayed genes or proteins can be grouped by biological roles (or biological pathways). 

For the rest of the paper, we assume that the prediction variables can be divided into 
K groups and the kth group contains pk variables. Specifically, the linear model ([1]) is now 
written as 

Vi = Po + ^^l3kjXi,kj + £i- (2) 
k=i j=i 

And we are interested in finding out which variables, especially which "groups," have an im- 
portant effect on the response. For example, (xn, . . . , XipJ, (x2i, . . . , 3:2^2)! • • •? {xki, • • • , ^Kp^) 
may represent different biological pathways, y may represent a certain phenotype and we are 
interested in deciphering which and how these biological pathways "work together" to affect 
the phenotype. 

There are two important challenges in this problem: prediction accuracy and interpre- 
tation. We would like our model to accurately predict on future data. Prediction accuracy 
can often be improved by shrinking the regression coefficients. Shrinkage sacrifices some bias 
to reduce the variance of the predicted value and hence may improve the overall prediction 
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accuracy. Interpretability is often realized via variable selection. With a large number of 
prediction variables, we often would like to determine a smaller subset that exhibits the 
strongest effects. 

Variable selection has been studied extensively in the literature, for example, see George 
and McCuUoch (1993), Breiman (1995), Tibshirani (1996), George and Foster (2000), Fan 
and Li (2001), Zou and Hastie (2005), Lin and Zhang (2006) and Wu, Boos and Stefanski 
(2007). In particular, lasso (Tibshirani 1996) has gained much attention in recent years. The 
lasso criterion penalizes the Li-norm of the regression coefficients to achieve a sparse model: 



where A > is a tuning parameter. Note that by location transformation, we can always 
assume that the predictors and the response have mean 0, so we can ignore the intercept in 
equation 

Due to the singularity at Pkj = 0, the Li-norm penalty can shrink some of the fitted 
coefficients to be exact zero when making the tuning parameter sufficiently large. However, 
lasso and other methods above are for the case when the candidate variables can be treated 
individually or "flatly." When variables are grouped, ignoring the group structure and 
directly applying lasso as in (|3]) may be sub-optimal. For example, suppose the kth group is 
unimportant, then lasso tends to make individual estimated coefficients in the kth group to 
be zero, rather than the whole group to be zero, i.e., lasso tends to make selection based on 
the strength of individual variables rather than the strength of the group, often resulting in 
selecting more groups than necessary. 

Antoniadis and Fan (2001), Yuan and Lin (2006) and Zhao, Rocha and Yu (2009) have 
addressed the group variable selection problem in the literature. Antoniadis and Fan (2001) 
proposed to use a blockwise additive penalty in the setting of wavelet approximations. To 
increase the estimation precision, empirical wavelet coefficients were thresholded or shrunken 
in blocks (or groups) rather than individually. 

Yuan and Lin (2006) and Zhao, Rocha and Yu (2009) extended the lasso model ([3]) 
for group variable selection. Yuan and Lin (2006) chose to penalize the L2-norm of the 




(3) 
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coefficients within each group, i.e., J2k=i ll/^fclb) where 

11/3.112 = + + (4) 

Due to the singularity of ||/3^,||2 at f3f, = 0, appropriately tuning A can set the whole coefficient 
vector (3j^ = 0, hence the kth group is removed from the ffited model. We note that in the 
setting of wavelet analysis, this method reduces to Antoniadis and Fan (2001). 

Instead of using the L2-norm penalty, Zhao, Rocha and Yu (2009) suggested using the 
Loo-norm penalty, i.e., J2k=i W^kWoo, where 

||/3fc||oo = max(|/3fci|, \/3k2\, • • • , IPkpJ)- (5) 

Similar to the L2-norm, the Loo-norm of is also singular when (3j^ = 0; hence when A is 
appropriately tuned, the Loo-norm can also effectively remove unimportant groups. 

However, there are some possible limitations with these methods: Both the L2-norm 
penalty and the Loo-norm penalty select variables in an "all-in-all-out" fashion, i.e., when 
one variable in a group is selected, all other variables in the same group are also selected. 
The reason is that both ||/3fc||2 and ||/3^||oo are singular only when the whole vector = 0. 
Once a component of f3j^ is non-zero, the two norm functions are no longer singular. This can 
also be heuristically understood as the following: for the L2-norm (|4]), it is the ridge penalty 
that is under the square root; since the ridge penalty can not do variable selection (as in 
ridge regression), once the L2-norm is non-zero (or the corresponding group is selected), all 
components will be non-zero. For the Loo-norm if the "max(-)" is non-zero, there is no 
increase in the penalty for letting all the individual components move away from zero. Hence 
if one variable in a group is selected, all other variables are also automatically selected. 

In many important real problems, however, we may want to keep the flexibility of selecting 
variables within a group. For example, in the gene-set selection problem, a biological pathway 
may be related to a certain biological process, but it does not necessarily mean all the 
genes in the pathway are all related to the biological process. We may want to not only 
remove unimportant pathways effectively, but also identify important genes within important 
pathways. 
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For the Loo-norm penalty, another possible limitation is that the estimated coefficients 
within a group tend to have the same magnitude, i.e. \(3ki\ = |/3fc2| = . . . = |/3fcpj.|; and this 
may cause some serious bias, which jeopardizes the prediction accuracy. 

In this paper, we propose an extension of lasso for group variable selection, which we call 
hierarchical lasso (HLasso). Our method not only removes unimportant groups effectively, 
but also keeps the flexibility of selecting variables within a group. Furthermore, asymptotic 
studies motivate us to improve our model and show that when the tuning parameter is 
appropriately chosen, the improved model has the oracle property (Fan and Li 2001, Fan 
and Peng 2004), i.e., it performs as well as if the correct underlying model were given in 
advance. Such a theoretical property has not been previously studied for group variable 
selection at both the group level and the within group level. 

The rest of the paper is organized as follows. In Section [2], we introduce our new method: 
the hierarchical lasso. We propose an algorithm to compute the solution for the hierarchical 
lasso in Section [31 In Sections H] and |5l we study the asymptotic behavior of the hierarchical 
lasso and propose an improvement for the hierarchical lasso. Numerical results are in Sections 
[6] and U\ and we conclude the paper with Section El 

2 Hierarchical Lasso 

In this section, we extend the lasso method for group variable selection so that we can 
effectively remove unimportant variables at both the group level and the within group level. 
We reparameterize (3kj as 

/3kj = dkakj, k = 1, . . . ,K; j = 1, . . . ,pk, (6) 

where dk > (for identifiability reasons). This decomposition reflects the information that 
Pkj,j = 1, . . . ,Pk, all belong to the kh group, by treating each Pkj hierarchically, dk is at the 
ffist level of the hierarchy, controlling f3kj,j = 1, . . . ,Pfc, as a group; ak/s are at the second 
level of the hierarchy, reflecting differences within the kth group. 
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For the purpose of variable selection, we consider the following penalized least squares 
criterion: 



max 

dk,OLkj 



i=i \ k=i j=i / 

K K Pk 



-Ai ■ ^4 - A2 ■ Ittfcjl (7) 

k=l k=l j=l 

subject to dk > 0, k = 1, . . . , K, 

where Ai > and A2 > are tuning parameters. Ai controls the estimates at the group 
level, and it can effectively remove unimportant groups: if 4 is shrunken to zero, all Pkj in 
the fcth group will be equal to zero. A2 controls the estimates at the variable-specific level: 
if dk is not equal to zero, some of the akj hence some of the /3kj,j = I, . . . ,pk, still have the 
possibility of being zero; in this sense, the hierarchical penalty keeps the flexibility of the 
Li-norm penalty. 

One may complain that such a hierarchical penalty may be more complicated to tune 
in practice, however, it turns out that the two tuning parameters Ai and A2 in ([7j) can be 
simplified into one. Specifically, let A = Ai ■ A2, we can show that ([7]) is equivalent to 

I ^ / K Pk K K Pk 

max -:^J2\y'~^^^^^^^^''^A 

'""''^ i=i \ k=i j=i J k=i k=i j=i 

subject to dk > 0, k = 1, . . . , K. 

Lemma 1 Let (d ,61*) he a local maximizer of then there exists a local maximizer 
(d ,a*) of such that dlalj = dkO^kj- Similarly, if (d , ck* j is a local maximizer of 
there exists a local maximizer (d , a* ) of ^ such that dl&lj = dlalj. 

The proof is in the Appendix. This lemma indicates that the final fitted models from 
([7]) and ([H]) are the same, although they may provide different dk and akj- This also implies 
that in practice, we do not need to tune Ai and A2 separately; we only need to tune one 
parameter A = Ai ■ A2 as in ([8]). 



3 Algorithm 

To estimate the and a^j in ([8]), we can use an iterative approach, i.e., we first fix and 
estimate akj, then we fix akj and estimate d^, and we iterate between these two steps until 
the solution converges. Since at each step, the value of the objective function (jS]) decreases, 
the solution is guaranteed to converge. 

When dk is fixed, ([8]) becomes a lasso problem, hence we can use either the LAR/LASSO 
algorithm (Efron, Hastie, Johnstone and Tibshirani 2004) or a quadratic programming pack- 
age to efficiently solve for akj. When akj is fixed, (jS]) becomes a non- negative garrote problem. 
Again, we can use either an efficient solution path algorithm or a quadratic programming 
package to solve for dk- In summary, the algorithm proceeds as follows: 

1. (Standardization) Center y. Center and normalize Xkj. 

2. (Initialization) Initialize and af'j with some plausible values. For example, we can 
set d^^^ = 1 and use the least squares estimates or the simple regression estimates by 
regressing the response y on each of the Xkj for afj. Let ^'^^ = df^a^,^j and m = 1. 



3. (Update akj) Let 



then 



S^i,kj dj^ Xi,kji k 1, . . . , K, j 1, . . . ,Pki 



K Pk \ 2 K Pk 



' = max - - ^ I ?/i - ^ ^ akjii,kj ] - A ^ ^ | a^^- 1 . 
4. (Update dk) Let 



(m) 

ChL^ liiaA 

«fc, 2 • , 

i=l \ k=l j=l / k=l j=l 



Pk 

Xi^k = ctfcj Xi,ki^ k = 1, . . . , K, 



then 



n / K \'^K 



4™^ = arg max ( ^ ^'^^^''^ ) " 4- 



i=l \ k=l / k=\ 

5. (Update fikj) Let 

Pkj — "fc "fcj • 
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6. If 11/3(7) is small enough, stop the algorithm. Otherwise, let m ^ m + 1 and 
go back to Step 3. 



3.1 Orthogonal Case 

To gain more insight into the hierarchical penalty, we have also studied the algorithm in 
the orthogonal design case. This can be useful, for example, in the wavelet setting, where 
each Xkj corresponds to a wavelet basis function, different k may correspond to different 
"frequency" scales, and different j with the same k correspond to different "time" locations. 
Specifically, suppose x^-x^j = 1 and x\-Xi^iji = if 7^ /c' or j 7^ j', then Step 3 and Step 4 
in the above algorithm have closed form solutions. 

Let = x\jy be the ordinary least squares solution when orthonormal to each 

other. 



Step 3. When dk is fixed, 



„i7) = I(4"-''>O).=gn0°h-(^-^^p^| . (9) 



Step 4. When auj is fixed 



Vk ('^'("'^2 /3ols 

4'") = i(3j,47Vo) - (y , T\ -T^\ ■ (10) 



Equations ([9]) and ( ITOl) show that both d/^ and are soft-thresholding estimates. Here 
we provide some intuitive explanation. 

We first look at a"^^ in equation ([9]). If rf^™'"^'' = 0, it is natural to estimate all Ukj to 
be zero because of the penalty on akj- If c^i™"^'' > 0, then from our reparametrization, we 
have akj = hj/d^r~^\ j = ^,---,Pk- Plugging in for [5kj, we obtain otkj = Pkj^ / d^iT"^^ ■ 
Equation iQ shrinks akj, and the amount of shrinkage is inversely proportional to (d^.'" ^))^. 
So when c?^™" is large, which indicates the kth group is important, the amount of shrinkage 
is small, while when c?^™ is small, which indicates the kth group is less important, the 
amount of shrinkage is large. 



Now considering d^^' in equation ( ITOl) . If all a^™ are zero, it is natural to estimate di' 
also to be zero because of the penalty on d^. If not all a^^^ are 0, say a^}^^ , . . . , a^.^"* are 
not zero, then we have d^ = f3kjja^f^^ ,1 < s < r. Again, plugging in for f3kj^, we 
obtain r estimates for d^: dk = < s < r. A natural estimate for d^ is then 

a weighted average of the dk, and equation (fTOl) provides such a (shrunken) average, with 
weights proportional to (ai-T'')^- 



4 Asymptotic Theory 

In this section, we explore the asymptotic behavior of the hierarchical lasso method. 

The hierarchical lasso criterion (|8]) uses d^ and akj- We first show that it can also be 
written in an equivalent form using the original regression coefficients /3kj- 

Theorem 1 If (d,a.) is a local maximizer of then ^, where Pkj = dkCtkj, is a local 
maximizer of 

2 



max 



i=l \ k=l j=l / 



K 



2^ • X V l^^il + l^'^^l + . . . + \Pkp,\. [n] 



k=l 



On the other hand, if (3 is a local maximizer of 01]) . then we define (d,cx.), where d 



k 



0,a.k = if W^kWi = 0; ^'^^ dk = \ M\^k\\i-, = I ^- if ^ 0. Then the so-defined 

(d,a.) is a local maximizer of 

Note that the penalty term in ffTTl) is similar to the L2-norm penalty (jl]), except that 
under each square root, we now penalize the Li-norm of (3^, rather than the sum of squares. 
However, unlike the L2-norm, which is singular only at the point (3^ = 0, (i.e., the whole 
vector is equal to 0), the square root of the Li-norm is singular at (3kj = no matter what 
are the values of other /3fcj's. This explains, from a different perspective, why the hierarchical 
lasso can remove not only groups, but also variables within a group even when the group is 
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selected. Equation (fTT]) also implies that the hierarchical lasso belongs to the "CAP" family 
in Zhao, Rocha and Yu (2009). 

We study the asymptotic properties allowing the total number of variables Pn, as well as 
the number of groups Kn and the number of variables within each group pnk, to go to oo, 
where P„ = ^^iPnfc- Note that we add a subscript "n" to K and pk to denote that these 
quantities can change with n. Accordingly, /3, i/i and Xi^kj are also changed to /3„, yni and 
Xni,kj- We write 2\/X in ( ITTj) as nA„, and the criterion ffTTj) is re- written as 

niaX — — ^ ^ j — ^ ^ ^ ^ Xni,kj(3n,kj 

^ i=\ \ k=\ j=l 

k=l 



An ■ Y |/3ji,fcl| + \Pn,k2\ + • • • + \Pn,kp„k\- (12) 



Our asymptotic analysis in this section is based on the criterion f lT2|) . 
Let /3° = (/3j^^, /3°^)^ be the underlying true parameters, where 

An = {(fc,j):/3°fc,7^0}, 

Bn = {(A;,j):/3°fc, = 0,/3°,^0}, 

C„ = {(A;,j):/3°, = 0}, 

= i3„UC,. (13) 

Note that An contains the indices of coefficients which are truly non-zero, C„ contains the 
indices where the whole "groups" are truly zero, and Bn contains the indices of zero coef- 
ficients, but they belong to some non-zero groups. So An, Bn and Cn are disjoint and they 
partition all the indices. We have the following theorem. 

Theorem 2 // ^/n\n = 0(1), then there exists a root-(n/Pn) consistent local maximizer 
= i^An^^Bn^^cS of ^E), and if also PnU'^/^/Xn as n oo, then PT0Cn = 0) 

1. 

Theorem [2] implies that the hierarchical lasso method can effectively remove unimportant 
groups. For the above root-(r;,/P„) consistent estimate, however, if 7^ (empty set), then 



9 



Pr(/3g^^ = 0) —7- 1 is not always true. This means that although the hierarchical lasso method 
can effectively remove all unimportant groups and some of the unimportant variables within 
important groups, it cannot effectively remove all unimportant variables within important 
groups. 

In the next section, we improve the hierarchical lasso method to tackle this limitation. 



5 Adaptive Hierarchical Lasso 

To improve the hierarchical lasso method, we apply the adaptive idea which has been used 
in Breiman (1995), Wang, Li and Tsai (2006), Zhang and Lu (2007), and Zou (2006), i.e., 
to penalize different coefficients differently. Specifically, we consider 



^ n / Kn Pk ^ ^ 

max ~ 9 X] ~ X] X] ^rii,kjl^n,k 

^"'"^ ^ 1=1 \ k=l 3=1 



-nXn ■ ^ ^Jwn,kl\Pn,kl \ + ^n,fc2 1 /3n,fc2 1 + • • • + Wn,kpk\(3n,kp„k\y (14) 
k=l 

where Wn,kj are pre-specified weights. The intuition is that if the effect of a variable is strong, 

we would like the corresponding weight to be small, hence the corresponding coefficient is 

lightly penalized. If the effect of a variable is not strong, we would like the corresponding 

weight to be large, hence the corresponding coefficient is heavily penalized. In practice, we 

may consider using the ordinary least squares estimates or the ridge regression estimates to 

help us compute the weights, for example, 

1 1 

"|7 ■ 



Wnki = — or Wnki = ^r^l 1 (15) 



where 7 is a positive constant. 

5.1 Oracle Property 
Problem Setup 

Since the theoretical results we develop for (HM are not restricted to the squared error loss, 
for the rest of the section, we consider the generalized linear model. For generalized linear 
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models, statistical inferences are based on underlying likelihood functions. We assume that 
the data Vni = (^m, Yni), i = 1, . . . ,n are independent and identically distributed for every 
n. Conditioning on Xni = Xni, Yni has a density fn{gn{xj^if3n),yni), where gn{-) is a known 
link function. We maximize the penalized log-likelihood 

max Qn{f3n) = Ln{/3J - J„(/3„) 

Pn , kj 

n K 

= ^Lign{xl.f3J,yni) -n^px„,n,„i(3nk), (16) 
i=l k=l 

where £n{-, ■) = log/„(-, ■) denotes the conditional log-likelihood of Y, and 

-|- . . . -|- U)n^kpf^\f^n,kp„k\- 

Note that under the normal distribution, in{gn{Xnil3n),yni) = — ^c^^"^ + C*2, hence 
ffT6|) reduces to (flil). 

The asymptotic properties of flTBl) are described in the following theorems, and the proofs 
are in the Appendix. We note that the proofs follow the spirit of Fan and Li (2001) and 
Fan and Peng (2004), but due to the grouping structure and the adaptive weights, they are 
non-trivial extensions of Fan and Li (2001) and Fan and Peng (2004). 

To control the adaptive weights, we define: 

an = max{Wn,kj ■ l^n,kj 7^ 0}) 
hn = mm{Wn,kj ■■ Pn,kj = 0}- 

We assume that 

< ci < min{/3°_^^. : /3° ^ 0} < max{/3° fc^- : (3^^^ ^ 0} < Cs < oo. 
Then we have the following results. 

Theorem 3 For every n, the observations {Vni, i = 1,2, . . . ,n} are independent and iden- 
tically distributed, each with a density /«(V„i,/3^) that satisfies conditions (A1)-(A3) in 
the Appendix. // — )■ and P^Xny/(h2 = Op{l), then there exists a local maximizer (3n of 
QniPn) such that - /3°|| = Op(v^(n-i/2 ^ A„y^)). 
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Hence by choosing A„y^ = Op{n ^/^), there exists a root-(n/P„) consistent penahzed 
hkehhood estimate. 

Theorem 4 For every n, the observations {V„i, i = 1, 2, . . . , n} are independent and iden- 
tically distributed, each with a density fn{Vni, (^n) that satisfies conditions (A1)-(A3) in the 
Appendix. If ^ 0, A„y^ = Op(n^^/^) and — Op{n), then there exists a root-{n/ Pn) 
consistent local maximizer /3„ such that: 

(a) Sparsity: Pr(^„.p^ = 0) ^ 1, where !)„ = fi„ UC„. 

(b) Asymptotic normality: If \n\/a^ = Op((nP„)~'^^^) and ^ ^ as n ^ oo, then we also 
have: 

V^A^ll/\^%J0,^^^ - ^l^J ^ Af{0, G), 

where An is a qx \An\ matrix such that A^AJ^ G and G is a qxq nonnegative symmetric 
matrix. J„(/3°_^^) is the Fisher information matrix knowing (3^^ — 0. 

The above requirements A„^/a^ = Op((nP„)~^^^) and = Op{n) as n — >■ oo can be 
satisfied by selecting A„ and Wn,kj appropriately. For example, we may let A„ = ^"^ogn ^ 
and Wn,kj = I A;/ where is the un-penalized likelihood estimate of which is 

root-(n/P„) consistent. Then wc have a„ = Op{l) and ^ = Op{^). Hence \n\/o^ — 
Op((nP„)~^^^) and — Op{n) are satisfied when ^ ^ 0. 

5.2 Likelihood Ratio Test 

Similarly as in Fan and Peng (2004), we develop a hkelihood ratio test for testing linear 
hypotheses: 

i/o:AX,^^ = vs. i/i : AX,^^ 7^ 0, 

where An is a x \An\ matrix and — > Iq for a fixed q. This problem includes testing 
simultaneously the significance of several covariate variables. 
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We introduce a natural likelihood ratio test statistic, i.e. 

Tn = 2\snpQn{f3jV)- sup Qn{f3jV)\ , 

where fin is the parameter space for /3„. Then we can obtain the following theorem regarding 
the asymptotic null distribution of the test statistic. 

Theorem 5 When conditions in (b) of Theorem [7] are satisfied, under Hq we have 

Tn xl, as n ^- cx). 

6 Simulation Study 

In this section, we use simulations to demonstrate the hierarchical lasso (HLasso) method, 
and compare the results with those of some existing methods. 

Specifically, we first compare hierarchical lasso with some other group variable selection 
methods, i.e., the L2-norm group lasso (jl]) and the Loo-norm group lasso ([5]). Then we 
compare the adaptive hierarchical lasso with some other "oracle" (but non-group variable 
selection) methods, i.e., the SCAD and the adaptive lasso. 

We extended the simulations in Yuan and Lin (2006). We considered a model which 
had both categorical and continuous prediction variables. We first generated seventeen 
independent standard normal variables, Zi, . . . , Zig and W. The covariates were then defined 
as Xj = {Zj + W)/\/2. Then the last eight covariates Xg, . . . ,Xie were discretized to 0, 1, 
2, and 3 by <^-\l/A), <l>~\l/2) and $-^3/4). Each of Xi, ... ,^8 was expanded through a 
fourth-order polynomial, and only the main effects of Xg, . . . , Xiq were considered. This gave 
us a total of eight continuous groups with four variables in each group and eight categorical 
groups with three variables in each group. We considered two cases. 

Case 1. "All-in-all-out" 

Y = [Xs + 0.5X| + O.lXf + 0.1X3^] + [Xg - 0.5X| + 0.15X| + O.IX^] 
+ [I(X9 = 0) + I(X9 = 1) + I(X9 = 2)] + e. 
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Case 2. "Not all-in-all-out" 

Y={X, + X|) + (2X6 - 1.5X|) + [I{Xg = 0) + 2 I{X, = 1)] + e. 

For all the simulations above, the error term e follows a normal distribution N(0,cr^), 
where was set such that each of the signal to noise ratios, Var(X^/3)/Var(e), was equal to 
3. We generated n = 400 training observations from each of the above models, along with 200 
validation observations and 10,000 test observations. The validation set was used to select 
the tuning parameters A's that minimized the validation error. Using these selected A's, we 
calculated the mean squared error (MSE) with the test set. We repeated this 200 times 
and computed the average MSEs and their corresponding standard errors. We also recorded 
how frequently the important variables were selected and how frequently the unimportant 
variables were removed. The results are summarized in Table [1] 

As we can see, all shrinkage methods perform much better than OLS; this illustrates that 
some regularization is crucial for prediction accuracy. In terms of prediction accuracy, we 
can also see that when variables in a group follow the "all-in-all-out" pattern, the L2-norm 
(group lasso) method performs slightly better than the hierarchical lasso method (Case 1 
of Tabled]). When variables in a group do not follow the "all-in-all-out" pattern, however, 
the hierarchical lasso method performs slightly better than the L2-norm method (Case 2 of 
Tabled]). For variable selection, we can see that in terms of identifying important variables, 
the four shrinkage methods, the lasso, the Loo-norm, the L2-norm, and the hierarchical lasso 
all perform similarly ("Non-zero Var." of Tabled])- However, the L2-norm method and the 
hierarchical lasso method are more effective at removing unimportant variables than lasso 
and the Loo-norm method ("Zero Var." of Tabled])- 

In the above analysis, we used the criterion (|H]) or fllip for the hierarchical lasso, i.e., we 
did not use the adaptive weights Wkj to penalize different coefficients differently. To assess 
the improved version of the hierarchical lasso, i.e. criterion f|T^ . we also considered using 
adaptive weights. Specifically, we applied the OLS weights in f[T^ to f lH)) with 7 = 1. We 
compared the results with those of SCAD and the adaptive lasso, which also enjoy the oracle 
property. However, we note that SCAD and the adaptive lasso do not take advantage of 
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Table 1: Comparison of several group variable selection methods, including the L2-norm 
group lasso, the Loo-norm group lasso and the hierarchical lasso. The OLS and the regular 
lasso are used as benchmarks. The upper part is for Case 1, and the lower part is for Case 2. 
"MSE" is the mean squared error on the test set. "Zero Var." is the percentage of correctly 
removed unimportant variables. "Non-zero Var." is the percentage of correctly identified 
important variables. All the numbers outside parentheses are means over 200 repetitions, 
and the numbers in the parentheses are the corresponding standard errors. 



Case 1: "All-in-all-out" 





OLS 


Lasso 


Loo 


L2 


HLasso 


MSE 


0.92 (0.018) 


0.47 (0.011) 


0.31 (0.008) 


0.18 (0.009) 


0.24 (0.008) 


Zero Var. 




57% (1.6%) 


29% (1.4%) 


96% (0.8%) 


94% (0.7%) 


Non-Zero Var. 




92% (0.6%) 


100% (0%) 


100% (0%) 


98% (0.3%) 


Case 2: "Not all-in-all-out" 




OLS 


Lasso 


Loo 


L2 


HLasso 


MSE 


0.91 (0.018) 


0.26 (0.008) 


0.46 (0.012) 


0.21 (0.01) 


0.15 (0.006) 


Zero Var. 




70% (1%) 


17% (1.2%) 


87% (0.8%) 


91% (0.5%) 


Non-zero Var. 




99% (0.3%) 


100% (0%) 


100% (0.2%) 


100% (0.1%) 



the grouping structure information. As a benchmark, we also computed the Oracle OLS 
results, i.e., OLS using only the important variables. The results are summarized in Table 
[2J We can see that in the "all-in-all-out" case, the adaptive hierarchical lasso removes 
unimportant variables more effectively than SCAD and adaptive lasso, and consequently, 
the adaptive hierarchical lasso outperforms SCAD and adaptive lasso by a significant margin 
in terms of prediction accuracy. In the "not all-in-all-out" case, the advantage of knowing 
the grouping structure information is reduced, however, the adaptive hierarchical lasso still 
performs slightly better than SCAD and adaptive lasso, especially in terms of removing 
unimportant variables. 
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To assess how the sample size affects different "oracle" methods, we also considered 
n=200, 400, 800, 1600 and 3200. The results are summarized in Figure [H where the first 
row corresponds to the "all-in-all-out" case, and the second row corresponds to the "not all- 
in-all-out" case. Not surprisingly, as the sample size increases, the performances of different 
methods all improve: in terms of prediction accuracy, the MSB's all decrease (at about 
the same rate) and get closer to that of the Oracle OLS; in terms of variable selection, the 
probabilities of identifying the correct model all increase and approach one. However, overall, 
the adaptive hierarchical lasso always performs the best among the three "oracle" methods, 
and the gap is especially prominent in terms of removing unimportant variables when the 
sample size is moderate. 

Table 2: Comparison of several "oracle" methods, including the adaptive hierarchical lasso, 
SCAD and the adaptive lasso. SCAD and adaptive lasso do not take advantage of the 
grouping structure information. The Oracle OLS uses only important variables. Descriptions 
for the rows are the same as those in the caption of Table [H 



Case 1: "All-in-all-out" 





Oracle OLS 


Ada Lasso 


SCAD 


Ada HLasso 


MSE 


0.16 (0.006) 


0.37 (0.011) 


0.35 (0.011) 


0.24 (0.009) 


Zero Var. 




77% (0.7%) 


79% (1.1%) 


98% (0.3%) 


Non-Zero Var. 




94% (0.5%) 


91% (0.6%) 


96% (0.5%) 


Case 2: "Not all-in-all-out" 




Oracle OLS 


Ada Lasso 


SCAD 


Ada HLasso 


MSE 


0.07 (0.003) 


0.13 (0.005) 


0.11 (0.004) 


0.10 (0.005) 


Zero Var. 




91% (0.3%) 


91% (0.4%) 


98% (0.1%) 


Non-zero Var. 




98% (0.4%) 


99% (0.3%) 


99% (0.3%) 
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Figure 1: Comparison of several oracle methods, including the SCAD, the adaptive lasso 
and the adaptive hierarchical lasso. SCAD and adaptive lasso do not take advantage of the 
grouping structure information. The Oracle OLS uses only important variables. The first row 
corresponds to the "all-in-all-out" case, and the second row corresponds to the "not all-in- 
all-out" case. "Correct zero ratio" records the percentage of correctly removed unimportant 
variables. "Correct non-zero ratio" records the percentage of correctly identified important 
variables. 
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7 Real Data Analysis 



In this section, we use a gene expression dataset from the NCI-60 collection of cancer cell 
lines to further illustrate the hierarchical lasso method. We sought to use this dataset to 
identify targets of the transcription factor p53, which regulates gene expression in response 
to various signals of cellular stress. The mutational status of the p53 gene has been reported 
for 50 of the NCI-60 cell lines, with 17 being classified as normal and 33 as carrying mutations 
(Olivier et al. 2002). 

Instead of single-gene analysis, gene-set information has recently been used to analyze 
gene expression data. For example, Subramanian et al. (2005) developed the Gene Set 
Enrichment Analysis (GSEA), which is found to be more stable and more powerful than 
single-gene analysis. Efron and Tibshirani (2007) improved the GSEA method by using new 
statistics for summarizing gene-sets. Both methods are based on hypothesis testing. In this 
analysis, we consider using the hierarchical lasso method for gene-set selection. The gene- 
sets used here are the cytogenetic gene-sets and the functionals gene-sets from the GSEA 
package (Subramanian et al. 2005). We considered 391 overlapping gene-sets with the size 
of each set greater than 15. 

Since the response here is binary (normal vs mutation), following the result in Section 



15.11 we use the logistic hierarchical lasso regression, instead of the least square hierarchical 

lasso. Note that a gene may belong to multiple gene-sets, we thus extend the hierarchical 

lasso to the case of overlapping groups. Suppose there are K groups and J variables. Let Qk 

denote the set of indices of the variables in the fcth group. One way to model the overlapping 

situation is to extend the criterion (|H]) as the following: 

n / K \ 
max ^^\^<^k ^ (17) 

K J 

k=i j=i 
subject to (ifc > 0, = 1, . . . , i^, 

where aj can be considered as the "intrinsic" effect of a variable (no matter which group 
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it belongs to), and different group effects are represented via different dk- In this section, 
£{rii, Hi) = i/jr^j — log(l+e''*) is the logistic log-likelihood function with yi being a 0/1 response. 
Also notice that if each variable belongs to only one group, the model reduces to the non- 
overlapping criterion ([8]). 

We randomly split the 50 samples into the training and test sets 100 times; for each split, 
33 samples (22 carrying mutations and 11 being normal) were used for training and the rest 
17 samples (11 carrying mutations and 6 being normal) were for testing. For each split, 
we applied three methods, the logistic lasso, the logistic L2-norm group lasso (Meier, van 
der Geer and Buhlmann 2008) and the logistic hierarchical lasso. Tuning parameters were 
chosen using five-fold cross-validation. 

We first compare the prediction accuracy of the three methods. Over the 100 random 
splits, the logistic hierarchical lasso has an average misclassification rate of 14% with the 
standard error 1.8%, which is smaller than 23%(1.7%) of the logistic lasso and 32%(1.2%) of 
the logistic group lasso. To assess the stability of the prediction, we recorded the frequency 
in which each sample, as a test observation, was correctly classified. For example, if a sample 
appeared in 40 test sets among the 100 random splits, and out of the 40 predictions, the 
sample was correctly classified 36 times, we recorded 36/40 for this sample. The results are 
shown in Figure |2l As we can see, for most samples, the logistic hierarchical lasso classified 
them correctly for most of the random splits, and the predictions seemed to be slightly more 
stable than the logistic lasso and the logistic L2-norm group lasso. 

Next, we compare gene-set selection of these three methods. The most notable difference 
is that both logistic lasso and the logistic hierarchical lasso selected gene CDKNIA most 
frequently out of the 100 random split, while the logistic L2-norm group lasso rarely selected 
it. CDKNIA is also named as wild-type p53 activated fragment-1 (p21), and it is known 
that the expression of gene CDKNIA is tightly controlled by the tumor suppressor protein 
p53, through which this protein mediates the p53-dependent cell cycle Gl phase arrest in 
response to a variety of stress stimuli (Loh, Moritz, Contente and Dobbelstein 2003). 

We also compared the gene-sets selected by the logistic hierarchical lasso with those 
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selected by the GSEA of Subramanian et al. (2005) and the GSA of Efron and Tibshirani 
(2007). The two most frequently selected gene-sets by the hierarchical lasso are atm pathway 
and radiation sensitivity. The most frequently selected genes in atm pathway by the logistic 
hierarchical lasso are CDKNIA, MDM2 and RELA, and the most frequently selected genes 
in radiation sensitivity are CDKNIA, MDM2 and BCL2. It is known that MDM2, the 
second commonly selected gene, is a target gene of the transcription factor tumor protein 
p53, and the encoded protein in MDM2 is a nuclear phosphoprotein that binds and inhibits 
transactivation by tumor protein p53, as part of an autoregulatory negative feedback loop 
(Kubbutat, Jones and Vousden 1997, Moll and Petrenko 2003). Note that the gene-set 
radiation sensitivity was also selected by GSEA and GSA. Though the gene-set atm pathway 
was not selected by GSEA and GSA, it shares 7, 8, 6, and 3 genes with gene-sets radiation 
sensitivity, p53 signalling, p53 hypoxia pathway and p53 Up respectively, which were all 
selected by GSEA and GSA. We also note that GSEA and GSA are based on the marginal 
strength of each gene-set, while the logistic hierarchical lasso fits an "additive" model and 
uses the joint strengths of gene-sets. 

8 Discussion 

In this paper, we have proposed a hierarchical lasso method for group variable selection. 
Different variable selection methods have their own advantages in different scenarios. The 
hierarchical lasso method not only effectively removes unimportant groups, but also keeps 
the flexibility of selecting variables within a group. We show that the improved hierarchical 
lasso method enjoys an oracle property, i.e., it performs as well as if the true sub-model were 
given in advance. Numerical results indicate that our method works well, especially when 
variables in a group are associated with the response in a "not all-in-all-out" fashion. 

The grouping idea is also apphcable to other regression and classification settings, for 
example, the multi-response regression and multi-class classification problems. In these 
problems, a grouping structure may not exist among the prediction variables, but instead, 
natural grouping structures exist among parameters. We use the multi-response regression 
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Figure 2: The number of samples vs the frequency that a sample was correctly classified on 
100 random splits of the p53 data. 



problem to illustrate the point (Breiman and Friedman 1997, Turlach, Venables and Wright 
2005). Suppose we observe {xi, Ui), . . ., {Xn, t/„), where each = {yn, . . . , yix) is a vector 
containing K responses, and we are interested in selecting a subset of the prediction variables 
that predict well for all of the multiple responses. Standard techniques estimate K prediction 
functions, one for each of the K responses, fk{x) — ^ki^i + • • • + PkpXp, k — 1, . . . ,K. 
The prediction variables {xi, . . . , Xp) may not have a grouping structure, however, we may 
consider the coefficients corresponding to the same prediction variable form a natural group, 
i.e., (32j, ■ ■ ■ , Pxj)- Using our hierarchical lasso idea, we reparameterize (3kj — dja^j, 
dj > 0, and we consider 



K n 



fe=l i=\ \ i=l 



2 



p V K 

-Ai ■^d^-\2-Y^^ \akj\. 
j=i j=i k=i 

Note that if dj is shrunk to zero, all (3kj, k = 1, . . . ,K will be equal to zero, hence the jth 
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prediction variable will be removed from all K predictions. If dj is not equal to zero, then 
some of the a^j and hence some of the /3kj, k = 1, . . . ,K, still have the possibility of being 
zero. Therefore, the jth variable may be predictive for some responses but non-predictive 
for others. 

One referee pointed out the work by Huang, Ma, Xie and Zhang (2009), which we were not 
aware of when our manuscript was first completed and submitted in 2007. We acknowledge 
that the work by Huang, Ma, Xie and Zhang (2009) is closely related with ours, but there 
are also differences. For example: 

• We proved the oracle property for both group selection and within group selection, 
while Huang, Ma, Xie and Zhang (2009) considered the oracle property only for group 
selection. 

• Our theory applies to the generalized maximum likelihood estimate, while Huang, Ma, 
Xie and Zhang (2009) considered the penalized least squares estimate. 

• Handling overlapping groups. It is not unusual for a variable to be a member of several 
groups. The gene expression date we analyzed in Section [7] is such an example: given 
a plethora of biologically defined gene-sets, not surprisingly, there will be considerable 
overlap among these sets. 

In Huang, Ma, Xie and Zhang (2009), a prediction variable that appears in more than 
one group gets penalized more heavily than variables appearing in only one group. 
Therefore, a prediction variable belonging to multiple groups is more likely to be re- 
moved than a variable belonging to only one group. We are not sure whether this is an 
appealing property. In our approach, as shown in f[T7|) . if a prediction variable belongs 
to multiple groups, it does not get penalized more heavily than other variables that 
belong to only one group. 
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Appendix 
Proof of Lemma 1 

Let Q*{Xi, A2, d, a.) be the criterion that we would like to maximize in equation (7) and let 
Q*{\,d,a) be the corresponding criterion in equation (8). 

Let {d ,dc*) be a local maximizer of Q*{Xi, X2, d, a). We would like to prove {d — 
Xid , a* — a*/Xi) is a local maximizer of Q*(X, d, a). 

We immediately have 

g*(Ai, A2, d, a) - Q*{X, Aid, a/Ai). 
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Since (d , a*) is a local maximizer of Q*(Xi, X2, d, a), there exists 5 > such that if d', ex! 
satisfy \\d! - + \\ol' - a\\ < 5 then Q*{\i, X2, d' , a') < Q*{\i, X2, d , a.*). 

Choose 6' such that — /' < 6, for any {d", oc") satisfying \\d" — d || + ||a" — Q:*|| < S' 

min(^Ai,^j 



we have 



d" ^* 



+ ||Aia"-d*|| < 



Ai 


d" 

Ai 


-d* 


+ f ||Aia"-a*|| 

Ai II II 






min 





< 



\d" -d\\ + - a* 
min (Ai, ^) 
5' 



min (Ai,^) 



< 6. 



Hence 

Q*(X,d,a") = g*(Ai,A2,d7Ai,Aia") 
< g*(Ai,A2,d 
= Q*{X,d\a*). 

Therefore, {d — Aid , a* = a.* /Xi) is a local maximizer of Q*{X^ d, ex.). 

Similarly we can prove that for any local maximizer (d ,6c*) of Q*{X,d,oc), there is a 
corresponding local maximizer (d , a*) of Q*(Xi, A2, d, a) such that dlalj — dl&lj. 

Lemma 2 Suppose (d,6c) is a local maximizer of (8). Let be the Hierarchical Lasso 

estimate related to (d,a.), i.e., f3kj = dkCtkj- If dk — 0, then cx^ — 0; if dk 7^ 0, then 

K 



anddk= \^ X\\(3,^\\i, dck 



Vmili' 



Proof of Lemma 2 

If dk = 0, then = is quite obvious. Similarly, if dtk = 0, then dk = 0. Therefore, if 
dk + 0, then OLk^^ and \hkh 7^ 0- 
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We prove dk = y Q.k = , for dj. 7^ by contradiction. Suppose 3A;' such 

that dk> ^ and 4' 7^ \/A||y3.,||i. Let ^^fr"' = c. Then = c . Suppose c> 1. 

Let 4 = dk and dtk — otk ior k k' and 4/ = 5' 4' and 0:^/ = ock'jr, where 5' satisfies 
c> 5' > 1 and is very close to 1 such that ||4' ~ 4'||i + llcKfe' — CKfe'Hi < 5 for some 5 > 0. 

Then we have 

Q*{X,d,dt) - Q*{X,d,a) = -S'\dk'\ - ^\\\ctk'\\i + \dk'\ + \\\dck'\\i 

6' c 1 



> 0. 

Therefore, for any 5 > 0, we can find d, ct such that d||i+||Q:— q:||i < S and Q*{X, d, a) > 
Q*{X, d, a). These contradict with (d, a) being a local maximizer. 

Similarly for the case when c < 1. Hence, we have the result that if dk 7^ 0, then 



4 = \/A||^fc||i, OLk 



Proof of Theorem 1 

Let Q{X,(3) be the corresponding criterion in equation (11). 

Suppose (d, a) is a local maximizer of (5*(A,<i, a), we first show that yS, where Pkj — 
dk&kj, is a local maximizer of Q{X,(3), i.e. there exists a 5' such that if ||A/3||i < 5' then 
Q(A,;9 + A/3) < g(A,^). 

We denote A/3 = + A/3(^\ where A/3i^^ = if pJli = and A^[^^^ = if 

ll^fclli ^ 0. We have ||A/3||i = ||A/3«||i + ||A/3(2)||i. 

Now we show Q(X,^ + A/3(^)) < Q{X,^) if 5' is small enough. By Lemma 2, we have 
dk — y A||^j^.||i, CKfc = , ^'1 if ||4||i 7^ and ckfc = if ||4||i = 0. Furthermore, let 

^ V ^ll^felli 

d'k = V^APfe + A/3«||i, a; = . "^l^),, if 7^ 0. Let < = 0, a', = if = 0. 

Then we have (5*(A, d , a') = Q(A, y9+A/3^^^) and Q*{X, d, a) = Q{X, Hence we only need 
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to show that Q*{X, ^ , a') < Q*{X, d, a). Note that {d, a) ia a local maximizer of Q*{X, d, a). 
Therefore there exists a S such that for any d' , ol' satisfying \\d! — + ||a' — q:||i < 5, we 
have g*(A, d', a') < Q*(A, d, a). 
Now since 



14 - dk 



A||^, + A/3«||i- VAII/3 



< 



< 



k\\i 



k 111 



mk\ 
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k 111 
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2 VAa - \5' 
lA||A/3^') 



fc 111 



2 ' 
where a = min{||^j;.||i : 7^ 0} and 5' < a/2. 



Furthermore 
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X0, + A(3^^\ 
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where6 = max{||^,||i:||^,||i^0}. 

Therefore, there exists a small enough 5', if ||A^(^)||i < 5' we have \\d -d\\ 1 + ||q; — Q^lli < 
5. Hence Q*{\,d ,dt ) < Q*{X,d,cx) (due to local maximality) and Q{X,^ + A(3^^^) < 

Next we show Q(X, ^ + + A(3^^^) < Q{X, ^ + A(3^^^). Note that 

K I 

g(A,;3 + A/3W + A/3(2)) - Q(A,^ + A/3(i)) = A/3(2)Vl(^*) - ^ A|| A/3(2)||i, 

k=l 

where /3* is a vector between + A/3(^) + A/3^^) and ^3 + A/?^^^ Since ||A/3(2)||i < 5' is small 
enough, the second term dominates the first term, hence we have (5(A,y9 + A/3(^) + A/3(^)) < 
g(A,^ + A/3«). 

Overall, we have that there exists a small enough 5', if || A/3||i < 5', then Q{X, ^ + A/3) < 
Q{X,^), which implies that ;9 is a local maximizer of Q{X,^). 

Similarly, we can prove that if /3 is a local maximizer of (5(A, /3), and if we let dk — 
J A||^;^||i, ock = „ for ll^fclli ^ and let 4 = 0, = for 0h\\i = 0> "then (d, a) is 

a local maximizer of Q*{X, d, a). 

Regularity Conditions 

Let Sn be the number of non-zero groups, i.e., ||/3°fc|| 7^ 0. Without loss of generality, we 
assume 

\\(3l,\\ ^ 0, forA; = l,...,5„, 
||/3°,|| = 0, for = + 

Let Snk be the number of non-zero coefficients ; again, without loss of 

generality, we assume 

^ 0, for /C = 1, . . . , 5^; J = 1, . . . , Snk, 

^n,k3 = 0, for A; = 1, . . .,5'„; j = + 1, ■ ■ ■ ,Pnfe- 
For simplicity, we write /3n,fej, Pnfe and Snk as and in the following. 
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Since we have diverging number of parameters, to keep the uniform properties of the 
hkehhood function, we need some conditions on the higher-order moment of the hkehhood 

function, as compared to the usual condition in the asymptotic theory of the hkehhood 
estimate under finite parameters (Lehmann and CaseUa 1998). 

(Al) For every n, the observations {Vni,i — 1,2, are independent and identically 

distributed, each with a density /„(V„i,/3„). /„(V„i,/3„) has a common support and 
the model is identifiable. Furthermore, the first and second logarithmic derivatives of 
/„ satisfy the equations 

-dlogMVnl,(3j 



0, for A; - 1 

d 



,pk 



log/„(F„i,/3„) 



d 



dp, 



k2h 



log/„(F„i,/3j 



= E 



'/3n 



l0gfn{Vnl,f3j 



(A2) The Fisher information matrix 



satisfies the condition 



E 



y3„ 



d 9" 

^ log fn{Vnl, /3n)^ log /n(V"nl, /3„) 



< Ci < A„i„{/(/3„)} < A„a.{/(/3J} < ^2 < oo. 



and for any ki,ji, k2,j2, we have 
d 



Er. 



df^kiji 



log/n(V„i,/3, 



d 



E 



dl3k2j2 
Q2 



iog/„(y,i,/3j 

log/n(Vnl,/3n) 



< Cs < OO, 



< C4 < OO. 



(A3) There exists an open subset a;„ of Jl^ e i?^" that contains the true parameter point 
/3° such that for almost all Vni, the density /„(V„i,/3„) admits all third derivatives 
fn{V ni, l3n) / {d/3kj^jj^d/3k2j2d/3k3j3) for all G cun- Furthermore, there exist functions 
Mnkijik2j2k3j3 such that 

" < Mnkihk2j2k3j3i^ni) for aU (3^ e 



f)R f)R F)R fog/"(^nl'/^n) 
(^Pfciji 'JPk232 '^Pksjs 

and E^jMX.^,^.^,^.^{Vni)] < C, < 00. 
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These regularity conditions guarantee the asymptotic normahty of the ordinary maximum 
hkehhood estimates for diverging number of parameters. 

For expositional simphcity, we will first prove Theorem 3 and Theorem 4, then prove 
Theorem 2. 

Proof of Theorem 3 

We will show that for any given e > 0, there exists a constant C such that 

Pr |^^sup^Q„(/3° + a^u) < gn(/3°)| > 1 - e, (18) 

where = ^/RnirT^I'^ + \ny/a^/2^/ci). This implies that with probability at least 1 — e, 
that there exists a local maximum in the ball {/3^ + q;„w : ||w|| < C}. Hence, there exists 
a local maximizer such that — = Op{an)- Since 1/2^/ci is a constant, we have 
\\K - /3nll = 0,(v/n:(n-i/2 + K^)). 
Using Pa„,'u;„(0) = 0, we have 

Dn{u) = g„(/3°+a„ii)-g„(/3°) 

Sn 

k=l 

= + (19) 
Using the standard argument on the Taylor expansion of the likelihood function, we have 
(/) = a,,u'VLM) + lu-'V'LM)ual + ^u''V{u-'V'L^{/3*Ju}al 

4 /, + /2 + /3, (20) 

where /3* lies between /3° and f3^ + anU. Using the same argument as in the proof of Theorem 
1 of Fan and Peng (2004), we have 



|7i| = 0,{aln)\\ul (21) 

2 

71 CM 

h = -^tiV„(/3°)n + o,(l)na^||n|p, (22) 
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and 



^ Kn Pk Kn Pk Kn Pk 

a 



fci=l ii=l fc2=l i2=l A:3=l j3= 



df3kijidf3k2j2df3k3j: 



< 



^ n ( Kn Pk Kn Pk Kn Pk ^ 
i=l l,A:i=l ji=l A:2=l J2=l A:3=l J3=l ) 



1/2 



Op(P5^an)na^||n| 



Since — — )■ and P^Xn^fa^ — )■ as n — )■ oo, we have 



I/3I = o„(na:)||u 



(23) 



From fl2Tl) - fl23l) . we can see that, by choosing a sufficiently large C, the first term in I2 
dominates Ji uniformly on \\u\\ = C; when n is large enough, I2 also dominates 1^ uniformly 
on ||u|| = C. 

Now we consider (//). Since a„ = y/P^^n^^^"^ + \ny/a^/2y/ci) — )■ 0, for \\u\\ < C we have 



\/3t + anUkjl > - \anUkj\ > 



(24) 



for n large enough and 7^ 0. Hence, we have 



> An('l/w„,fcl|/3°i| + . . . + Wn,ksk\(^ksk\ ~ (^n{Wn,kl\Ukl\ + • • • + Wn,fcsjMfcsJ) 



' Wn,ki\l^li\ + ■ ■ ■ + Wn,ksk\l^lsk\) (for u large cuough, by ([2 

= ^n^J Wn,kl\(3l^\ + ■ ■ ■ + Wn,ksk\Plsk\W^ - Ink - I). 

where 7„fc is defaned as 7„fc = igo , , , ^ |go i ■ For n large enough, we have < 



7„fc < 1 and 'jnk < 



as n — > oo. 



an||-"fc||(«'„,fei+---+w„^fcsj.) _ a„||ufe|| ^ a„C 



Cl(w„,fel + ...+1i),i,fcsj.) 



< with probability tending to 1 
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Therefore, 



> >^n^Wn,kl\l^tl\ + ■ ■ ■ + '^n,ks,\l3Ui^l - Ink - I) 



> -A. 



(Using 7„fc = Op(l) and Taylor expansion) 

Oin{Wn,kl\Ukl\ + • • • + Wn,ks^UksJ\) f 1+ \Op{l] 



^JwnM\Pkl\ + ■■■+Wn,ksMsJ 



. . \\Uk\\^/a^ 

> -arXi 7ri= (l + |Op(l)l)- 

z^j ci 

Therefore, the term (//) in f[T^ is bounded by 



which is further bounded by 



nan\nVa^{\\u\\ ■ 7r7=)(l + bp(l)l)- 
Zyf ci 

Note that an = \fP^{n~^l'^ + A„i/a^/2y^), hence the above expression is bounded by 

\\u\\nal{\\\op{V)\). 

This term is also dominated by the first term of li on ||w|| = C uniformly. Therefore, 
Dn{u) < is satisfied uniformly on ||n|| = C. This completes the proof of the theorem. 

Proof of Theorem 4 

We have proved that if A„y^ = Op(n^^/^), there exists a root-(?2/P„) consistent estimate 
/3„. Now we prove that this root-(?2/Pn) consistent estimate has the oracle sparsity under 
the condition -^j- = Op{n), i.e., (3kj = with probability tending to 1 if /3^j = 0. 
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Using Taylor's expansion, we have 



dPkj d/3kj d/3kj 

" ^^^^ tih^^'^^^'^^^ 

^ K„ Pki Kn P'=2 (fi Ij 



n-^n'^n,kj 



'^^/Wn,kl\Pkl\ + ■■■ + Wn,kpk\Pkpk\ 
^ + + + 



sgn(/3fci) 



(25) 



where y9* hes between and 

Using the argument in the proof of Lemma 5 of Fan and Peng (2004), for any (3^ satisfying 

WPn - /3nll = Opiy^Pjn), wB have 



h = Op(^^) = Op(v^), 

h = Op(y^), 

-^3 = Op{^/nP^). 



Then, since /9„ is a root-(n/Pn) consistent estimate maximizing Qn{/3n)-> if ^kj 7^ 0, we 
have 



= 0. 



sgn(4j) 



(26) 



Therefore, 



WnM\Pkl\ + • • • + Wn^kpk\l^kpk\ 

This can be extended to 



Op(^^^) for ^ 0. 



n\nWn,kjWkj\ 



Wn,fcl|A 



kl\ 



'^n,kpk l/^fcpfe 



\Pkj\Op{^/nPn), 
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for any Pj^j with f3j^f^ 7^ 0. If we sum this over all j in the kth group, we have 

. Pk 

71/ 



^nV Wn,kl\l^kl\ + . . . + |4pJ = El/^'^^-|^p(v^)- (27) 



Since is a root-(r2/P„) consistent estimate of /3°, we have \f3kj\ = Op{l) for {k,j) G An 



and \Pkj\ = Op{^/Pn/n) for (kj) G i3„ U C„. 

Now for any k and j satisfying = and 7^ 0, equation fl26|) can be written as: 



(28) 



/3n=^n 2A„iyW„,fcl|/3fel| + • • • + Wn,kpk\^kpk\ 

{Op{^/ Pnln)n\n\l Wn,kSkl\ + ■■■ + \f^kpk\ 

-nXlwn,kjSgn0kj)) 
= 0. 

Denote h^k = Opi^J Pn/n)nXn\/wn,ki\Pki\ + ■■■ +Wn,kpJPkpJ- Let /i„ = X]f=i ^nfc- By equa- 



tion (127D, we have hn = Ef="i Op( /fV^) ^ -=1 |/3fe.|0p(v^) = Op{P^). Since 



Op(n) guarantees that nX^bn dominates hn with probability tending to 1 as n — )■ 00, the first 
term in fl28p is dominated by the second term as n — i- 00 uniformly for all k and j satisfying 

= since Wn,kj > K and K > Kk- Denote g^k = Wn,ki\hi\ + • • • + Wn,kpk\hpk\/ {n\lh 
Let Qn = E£i ^?nfc. By equation (EZD, we have = 2 E£i(1/^) E^ii 0kj\Op{V^)/{nXlb„) 
Op{l / y/nPn) . The absolute value of the second term in fl28|) is bounded below by l/gn- So 
with probability uniformly converging to 1 the second term in the derivative ^^^|^^^ will 
go to 00 as n — !■ 00, which is a contradiction with equation (!28|) . Therefore, for any k and 
j satisfying I3^j = 0, we have (S^j = with a probability tending to 1 as n — )■ 00. We have 
f^Vn = with probability tending to 1 as well. 

Now we prove the second part of Theorem 4. From the above proof, we know that 
there exists (/3„_4^, 0) with probability tending to 1, which is a root-(?T,/P„) consistent local 
maximizer of Q{l3^). With a slight abuse of notation, let Qn{(3n,A„) ~ Qnif^n^An^ Using 
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the Taylor expansion on V(5n(^n,^„) point /3^^^^, we have 

where /3; _4^ hes between ^„__4^ and /3°^^^. 
Now we define 



(29) 



n 



-j^ n S„ Pk S„ Pk S„ Pk 

- :;;2 X] M\^n,An ~ (^n,An 11"^ X] X] X] X] X] X] ^Ik^hk2hk332 O^ni) 



Using the Cauchy-Schwarz inequahty, we have 

2 n Sn Pk S„ Pk Sn Pk 

n 

Since — > as n — > oo, by Lemma 8 of Fan and Peng (2004), we have 

1, 



j=l fcl=l jl = l fc2 = l j2 = l fc3 = li3 = l 

2 /^2\r^ I d3\ _ /n / d5 



and 



Since 



-^'lm,a:)^u^i.a:) 



Op{l/Pn) 



-V^L„(/3°_^^) + /„(/3°^^^)^ 0n,An ~ l^n,A„] 



(30) 



Op(l/^^i^) = Op(l/V^). (31) 



= V^„,fei|/3,M(i + Op(v^)) + . . . + ^n,fe.J/3LJ(i + Op(a/^)) 



we have 



kj 



:(l+Op(v/^)). 
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Furthermore, since 



for {k,j) G An, we have 
1 



< ^^^^ < = o^iinPn)-'^') 



n 



VJn{f3. 



n.An 



2\ Wn,kl\Pkl \ + ■ ■ ■ + Wn,ksu\ Pks^ I 



Op{{nPn 



-l/2x 



and 



n 



(32) 



Together with (jSOD, ([M]) and (EID, from ([29]) we have 

Now using the same argument as in the proof of Theorem 2 of Fan and Peng (2004), we 
have 

where A„ is a g x \ An\ matrix such that — G and G is a g x g nonnegative symmetric 

matrix. 



Proof of Theorem 2 



Note that when w 



n,kj 



1, we have a„ = 1 and 6„ = 1. The conditions An^/o^ = 0„(?2 ^/^) 



and -j^j- = Op{n) in Theorem 4 become \n\/n = Op{l) and — > 0. These two conditions 
cannot be satisfied simultaneously by adjusting A„, which implies that Pr(/3j, = 0) — )■ 1 
cannot be guaranteed. 

We will prove that by choosing A^ satisfying y/n\n = Op{l) and Pnn~^^^ / \n — )■ as 

n — )■ oo, we can have a root-n consistent local maximizer = f3^^, ^(^^) such that 
Pr(^c„ = 0) ^ 1. 

Similar as in the proof of Theorem 4, we let /i^ = ^nk- By equation ( 1271) . we 



have < = Ef:5„+i Op(v/iV^) E •=! I4i|0,(v^) = Op(i'„Vy^). Since P„n-3/VA„ ^ 
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guarantees that nA^ dominates with probabihty tending to 1 as n — )■ oo, the first term 
in f l28p is dominated by the second term as n — oo uniformly for any k satisfying (3^^^^ = 
since Wn,kj = 1 and /;,'„ > hnk- Similar as in the proof of Theorem 4, we have /3(j^ = with 
probability tending to 1. 

Proof of Theorem 5 

Let Nn = \An\ be the number of nonzero parameters. Let Bn be an (A^„ — x Nn matrix 
which satisfies BnB^ = lN„-q and AnBj^ = 0. As /3„_4^ is in the orthogonal complement to 
the linear space that is spanned by the rows of A„ under the null hypothesis Ho, it follows 
that 

where 7„ is an {Nn — q) x 1 vector. Then, under Hq the penalized likelihood estimator is 
also the local maximizer 7„ of the problem 

Qn{f3n,Aj = maxQ„(S;7„). 
To prove Theorem 5 we need the following two lemmas. 

Lemma 3 Under condition (6) of Theorem 4 and the null hypothesis Hq, we have 

Bli% - ll) = -Bl{BMf3%jBl}-'BnWLni^%J + o.in'"). 
Proof of of Lemma 3 

We need only prove the second equation. The first equation can be shown in the same 
manner. Following the proof of Theorem 4, it follows that under Hq, 

BrMf^lAjKiin - ll) = ^BnVLnif3l^J + Op(n-i/2). 
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As the eigenvalue Aj(S„J„(/3°_^^)S^) is uniformly bounded away from and infinity, we 
have 

I b 

Lemma 4 Under condition {h) of Theorem 4 and the null hypothesis Hq, we have 

Qn0n,AJ - Qn{Bl7n) (33) 
= l^{Pn,A. - Blln) UPUjiPnAu ' Bl%) + 0,(1). 

Proof of Lemma 4 

A Taylor's expansion of Qn0n,An) ~ QniBlin) at the point yields 
where 

T2 = -l0n,Ar. - BllS^^LX^n,Au){KA^ - Bl%), 



Ta = \0r.A.-BllSV^W,Aj{KAu-Blln)- 



6 
1 

2 

We have Ti = as V^Qn{K,An) = 0- 

Let e„ = In{0i,An) *n = ^VL„(/3° _4^). By Lemma 2 we have 

i(^n,A„ ~ Bj/Yj^) 

= e;;V2{/„ - @l/'Bi{BAB:r'Bn@l/'}@-'/'^n 

+Op{n-'/'). 

In—@n'^Bl^{BnGnBU~^Bn&l/'^ is an idempotent matrix with rank q. Hence, by a standard 
argument and condition (A2), 



0n,A.-B:iJ=O,{^). 
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We have 



and 



Combining 



and 



(- VV„(/3„,^j) = 0, for k ^ h 



/ kjkji 
^•n^n,kj^n,kji 



(34) 



< 



^Wr^MlPl.l + . . . + Wn,ksMsjy^' 
^nWn,kj'Wn,kji 

4K,,i|/30J + ... + ^„,,,J/3LJ)=^/2 

^"^/^ (1 + 0,(1)) 



(l + Op(l)) 



4(Ci)3/2 

o,((nP„)-i/^). 



(35) 



(I35p and condition q < Pn, following the proof of I3 in Theorem 3, we have 
^3 = 0,{nP^/'n-'/'q'/') = o,(l) 



Ta < n 



n 



= nP^0p{{nP^)-'/^)0p{^) 

n 

= o,{l). 

Thus, 

Qn0n,AJ - Qn{Bl%) = + 0,(1). 

It follows from Lemmas 8 and 9 of Fan and Peng (2004) that 



(36) 



n 



Hence, we have 



(37) 



The combination of (136|) and (1371) yields (!33ll . 
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Proof of Theorem 5 

Given Lemmas 3 and 4, the proof of the Theorem is similar to the proof of Theorem 4 in 
Fan and Peng (2004). 
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